NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling

Ahmad, Sohaib; Guan, Hui; Friedman, Brain; Williams, Thomas; Sitaraman, Ramesh; Woo, Thomas (April 2024, ACM)

Existing machine learning inference-serving systems largely rely on hardware scaling by adding more devices or using more powerful accelerators to handle increasing query demands. However, hardware scaling might not be feasible for fixed-size edge clusters or private clouds due to their limited hardware resources. A viable alternate solution is accuracy scaling, which adapts the accuracy of ML models instead of hardware resources to handle varying query demands. This work studies the design of a high-throughput inferenceserving system with accuracy scaling that can meet throughput requirements while maximizing accuracy. To achieve the goal, this work proposes to identify the right amount of accuracy scaling by jointly optimizing three sub-problems: how to select model variants, how to place them on heterogeneous devices, and how to assign query workloads to each device. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Based on the proposed techniques, we build an inference-serving system called Proteus and empirically evaluate it on real-world and synthetic traces. We show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× with respect to baseline schemes, while meeting throughput requirements.
more » « less
Full Text Available
Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling

Ahmad, Sohaib; Guan, Hui; Friedman, Brain; Williams, Thomas; Sitaraman, Ramesh; Woo, Thomas (April 2024, ASPLOS'24)

Existing machine learning inference-serving systems largely rely on hardware scaling by adding more devices or using more powerful accelerators to handle increasing query demands. However, hardware scaling might not be feasible for fixed-size edge clusters or private clouds due to their limited hardware resources. A viable alternate solution is accuracy scaling, which adapts the accuracy of ML models instead of hardware resources to handle varying query demands. This work studies the design of a high-throughput inferenceserving system with accuracy scaling that can meet throughput requirements while maximizing accuracy. To achieve the goal, this work proposes to identify the right amount of accuracy scaling by jointly optimizing three sub-problems: how to select model variants, how to place them on heterogeneous devices, and how to assign query workloads to each device. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Based on the proposed techniques, we build an inference-serving system called Proteus and empirically evaluate it on real-world and synthetic traces. We show that Proteus reduces accuracy drop by up to 3× and latency timeouts by 2-10× with respect to baseline schemes, while meeting throughput requirements.
more » « less
Full Text Available
Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling

https://doi.org/10.1145/3617232.3624849

Ahmad, Sohaib; Guan, Hui; Friedman, Brian D; Williams, Thomas; Sitaraman, Ramesh K; Woo, Thomas (April 2024, ACM)

Full Text Available
An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

https://doi.org/10.1109/SC41405.2020.00094

Wang, Shaoqi; Gonzalez, Oscar J.; Zhou, Xiaobo; Williams, Thomas; Friedman, Brian D.; Havemann, Martin; Woo, Thomas (November 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis)
null (Ed.)
Full Text Available

Search for: All records